I am trying to learn how beautifulsoup works in order to create an application.
I am able to find and print all elements with .find_all() however they print the html tags as well. How can I print ONLY the text within these tags.
This is what I have:
from bs4 import BeautifulSoup
"""<html>
<p>1</p>
<p>2</p>
<p>3</p>
"""
soup = BeautifulSoup(open('index.html'), "html.parser")
i = soup.find_all('p')
print i
This may help you:-
from bs4 import BeautifulSoup
source_code = """<html>
<p>1</p>
<p>2</p>
<p>3</p>
"""
soup = BeautifulSoup(source_code)
print soup.text
Output:-
1
2
3
soup = BeautifulSoup(open('index.html'), "html.parser")
i = soup.find_all('p')
for p in i:
print p.text
find_all() will return a list of tag, you should iterate over it and use tag.text to get the text under the tag
Better way:
for p in soup.find_all('p'):
print p.text
I think you can do what they do in this stackoverflow question. Use findAll(text=True). So in your code:
from bs4 import BeautifulSoup
"""<html>
<p>1</p>
<p>2</p>
<p>3</p>
"""
soup = BeautifulSoup(open('index.html'), "html.parser")
i = soup.findAll(text=True)
print i
Related
I have an html file:
...
<span class="value">401<span class="Suffix">st</span></span>
...
and I want to get only first span tag text which is 401 but when I run:
>>> get_text = soup.find(class_ = 'value').text
>>> print(get_text)
401st
the output contains inner spans text(st).
This will work fine:
from bs4 import BeautifulSoup
s = '''<span class="value">401<span class="Suffix">st</span></span>'''
soup = BeautifulSoup(s, 'html.parser')
get_text = soup.find(class_='value')
print(get_text.contents[0])
Output
401
Use .strings. This is a #property that gives a generator of individual string elements. .text or .get_text is what joins those strings together.
>>> soup = bs4.BeautifulSoup('<span class="value">401<span class="Suffix">st</span></span>')
>>> t = soup.find(class_='value')
>>> next(t.strings)
'401'
>>> list(t.strings)
['401', 'st']
You need to use recursive=False in BeautifulSoup
<span class="value">401<span class="Suffix">st</span></span>
soup = BeautifulSoup(html)
all_parent_p = soup.find_all('p', recursive=False)
for parent_p in all_parent_p:
ptext = parent_p.find(text=True, recursive=False)
html = '<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'
soup = BeautifulSoup(html, 'html.parser')
sup_elem = soup.find("sup").string # 33 - it works
How do I get the "96" before the element ?
You can grab the previousSibling tag
from bs4 import BeautifulSoup
html = '''<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'''
soup = BeautifulSoup(html, 'html.parser')
elem1 = soup.find("sup").previousSibling
elem2 = soup.find("sup").text # 33 - it works
print ('.'.join([elem1, elem2]))
Output:
96.33
You can use children method. It will return a list of all the children of p tag. (6 will be first child of it.
html = '<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'
soup = BeautifulSoup(html, 'html.parser')
elem = list(soup.find("p").children)[0] #0th element of the list will be 96
sup_elem = soup.find("sup").string
result = elem + '.' + sup_elem #96.33
Use select instead.
from bs4 import BeautifulSoup
html = '''<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('.product-new-price').text.strip().replace('Lei',''))
There is no "." in source but you can always divide by 100
print(int(soup.select_one('.product-new-price').text.strip().replace('Lei',''))/100)
I use beautifulsoup to extract multiple attribute values of class, but ['fa', 'fa-address-book-o'] is not the result I want.
from bs4 import BeautifulSoup
html = "<i class='fa fa-address-book-o' aria-hidden='true'></i>"
soup = BeautifulSoup(html, "lxml")
h2 = soup.select("i")
print(h2[0]['class'])
I want the effect to be as follows:
fa fa-address-book-o
join all the elements in your list, and put a space between them
from bs4 import BeautifulSoup
html = "<i class='fa fa-address-book-o' aria-hidden='true'></i>"
soup = BeautifulSoup(html, "lxml")
h2 = soup.select("i")
print(' '.join(h2[0]['class']))
I try to catch the text of an id with BeautifulSoup. The result should be 30,66.
My actual code print the complete span element:
[<span class="mainValueAmount simpleTextFit" id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue">30,66</span>]
How do I get just the value 30,66?
from bs4 import BeautifulSoup
u = '<div class="widgetBox" data-name="pvEnergy"><div class="widgetHead">PV-Energie</div><div class="widgetBody"><div class="mainValue"><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue" class="mainValueAmount simpleTextFit">30,66</span><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldUnit" class="mainValueUnit">kWh</span><br><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldPeriodTitle" class="mainValueDescription">Heute</span></div></div><div id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalDiv" class="widgetFooter">Gesamt: <span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalValue">158,953</span><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalUnit">MWh</span></div></div>'
idAktWert = 'ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue'
soup = BeautifulSoup(u, "html.parser")
aktWert = soup.select("#" + idAktWert)
print(aktWert)
Thanks for your help!
Use .text
Ex:
from bs4 import BeautifulSoup
u = '<div class="widgetBox" data-name="pvEnergy"><div class="widgetHead">PV-Energie</div><div class="widgetBody"><div class="mainValue"><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue" class="mainValueAmount simpleTextFit">30,66</span><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldUnit" class="mainValueUnit">kWh</span><br><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldPeriodTitle" class="mainValueDescription">Heute</span></div></div><div id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalDiv" class="widgetFooter">Gesamt: <span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalValue">158,953</span><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalUnit">MWh</span></div></div>'
idAktWert = 'ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue'
soup = BeautifulSoup(u, "html.parser")
aktWert = soup.select("#" + idAktWert)[0] #Note: I have used Index to select the first element in list.
print(aktWert.text)
Output:
30,66
You simply need get_text() for this.
from bs4 import BeautifulSoup
u = '<div class="widgetBox" data-name="pvEnergy"><div class="widgetHead">PV-Energie</div><div class="widgetBody"><div class="mainValue"><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue" class="mainValueAmount simpleTextFit">30,66</span><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldUnit" class="mainValueUnit">kWh</span><br><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldPeriodTitle" class="mainValueDescription">Heute</span></div></div><div id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalDiv" class="widgetFooter">Gesamt: <span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalValue">158,953</span><span id="ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldTotalUnit">MWh</span></div></div>'
idAktWert = 'ctl00_ContentPlaceHolder1_PublicPagePlaceholder1_PageUserControl_ctl00_PublicPageLoadFixPage_energyYieldWidget_energyYieldValue'
soup = BeautifulSoup(u, "html.parser")
aktWert = soup.select("#" + idAktWert)
// since aktWert is an array, we need to get the 1st index
print(aktWert[0].get_text()) // outputs 30,66
How to extract data that is inside <p> paragraph tags and <li> which are under a named <div> class?
Use the functions find() and find_all():
import requests
from bs4 import BeautifulSoup
url = '...'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', {'class':'class-name'})
ps = div.find_all('p')
lis = div.find_all('li')
# print the content of all <p> tags
for p in ps:
print(p.text)
# print the content of all <li> tags
for li in lis:
print(li.text)