Extract element from HTML - python

I have html:
<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>
With:
s = soup.find('div', {'class' : 'img-holder'}).h1
s = s.get_text()
Displays the 'Sample image'.
How do i get the image src using the same format?

Use img.attrs["src"]
Ex:
from bs4 import BeautifulSoup
s = """<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>"""
soup = BeautifulSoup(s, "html.parser")
s = soup.find('div', {'class' : 'img-holder'})
print( s.img.attrs["src"] )

Like this?
soup.find('img')['src']

Related

How to BeautifulSoup getting value that following div class

I'm trying to extract the " 24.8 " from the following HTML code:
<div class="anlik-sicaklik">
<div class="anlik-sicaklik-deger ng-binding" ng-bind="sondurum[0].sicaklik | comma">
24,8
::after
</div>
<div class="anlik-sicaklik-havadurumu">
<div class="anlik-sicaklik-havadurumu-ikonu">
Here's my code
import requests
from bs4 import BeautifulSoup
r = requests.get("https://mgm.gov.tr/tahmin/il-ve-ilceler.aspx?il=ANTALYA&ilce=KUMLUCA")
soup = BeautifulSoup(r.content, "lxml")
sicaklik = soup.find('div', {'class':'anlik-sicaklik-deger'})
print(sicaklik)
My code's output
<div class="anlik-sicaklik-deger" ng-bind="sondurum[0].sicaklik | comma">
</div>
could you please help me to get 24,8 value?
Your question concern more about parsing string than web-page. So it is better, once found the tag with bs4, to parse the string with some regex.
The matching condition ([0-9]+,[0-9]) is one or more number separated by a , and then a number again.
Notice the the final result, nr, is a string, to make it a number you should use float(nr.replace(',', '.')).
from bs4 import BeautifulSoup
import re
html = """
<div class="anlik-sicaklik-deger ng-binding" ng-bind="sondurum[0].sicaklik | comma">
24,8
::after
</div>
"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='anlik-sicaklik-deger', string=True)
# get text
text = str(div.string).strip()
# regex
nr = re.search(r'([0-9]+,[0-9])', text).group(0)
print(nr)
Output
24,8
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
sicaklik = soup.find('div', {'class':'anlik-sicaklik-deger'}).**text**

Trouble getting <a> tag using BeautifulSoup

I need to get a href attribute from <а> tag, but it doesn't work
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
print(a_tags[0].p) #print <p> tag
print(a_tags[0].a) #print 'None'
print(a_tags[0].a.get('href')) #doesn't work
but if I try to print(a_tags) it shows them:
[<a href="/org/colleges/instrcol/Pages/pic1.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic1" src="iblock/6ba/%d0%90%d0%b1%d1%80%d0%b0%d0%bc%d0%be%d0%b2%20%d0%a1%d0%b5%d1%80%d0%b3%d0%b5%d0%b9%20%d0%90%d0%bd%d1%82%d0%be%d0%bd%d0%b8%d0%b4%d0%be%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic1</p></div>
</a>, <a href="/org/colleges/instrcol/Pages/pic2.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic2" src="iblock/1ee/%d0%90%d0%b3%d0%b0%d1%84%d0%be%d0%bd%d0%be%d0%b2%20%d0%9f%d0%b0%d0%b2%d0%b5%d0%bb%20%d0%92%d0%b8%d1%82%d0%b0%d0%bb%d1%8c%d0%b5%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic2</p></div>
</a>,
...
What is causing this problem?
You forgot to add href=True while using find_all()
Try this:
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a', href=True)
for a_tag in a_tags:
print(a_tag['href'])
a_tags contains <a> already.
Replace a_tags[0].a.get('href') with a_tags[0].get('href').

Extract content of div tag except other tags using BeuatifulSoup

I have below HTML content, wherein div tag looks like below
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
From above I want to extract text only as "aaa" and not other tags content.
When I do,
soup.find('div', {"class": "block"})
it gives me all the content as text and I want to avoid the contents of p tag.
Is there a method available in BeautifulSoup to do this?
Check the type of element,You could try:
from bs4 import BeautifulSoup
from bs4 import element
s = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
<h1>ddd</h1>
</div>
'''
soup = BeautifulSoup(s, "lxml")
for e in soup.find('div', {"class": "block"}):
if type(e) == element.NavigableString and e.strip():
print(e.strip())
# aaa
And this will ignore all text in sub tags.
You can remove the p tags from that div, which effectively gives you the aaa text.
Here's how:
from bs4 import BeautifulSoup
sample = """<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
"""
s = BeautifulSoup(sample, "html.parser")
excluded = [i.extract() for i in s.find("div", class_="block").find_all("p")]
print(s.text.strip())
Output:
aaa
You can use find_next(), which returns the first match found:
from bs4 import BeautifulSoup
html = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find('div', {"class": "block"}).find_next(text=True))
Output:
aaa

BS4 Python get a href url

I stacked with on bs4 script, I need to get href link or meta content, how I could that? Basically I need to get this :
<meta itemprop="image" content="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950">
or
<img src="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950" alt="Posted by Publica Group " width="120" height="50" class=" b-loaded" style="display: inline;">
I tried do that with :
logoscrap = soup.find('meta', attrs={'itemprop': 'image'})
and
logoscrap = soup.find('img', class_="b-loaded").attrs['src']
But my code isn't work...
soup.find return dict object you can directly acces attibute from dict
img = soup.find('meta', attrs={'itemprop': 'image'})
logoscrap = img['content']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950
or
img = soup.find('img', class_="b-loaded")
logoscrap = img['src']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950

Is there an InnerText equivalent in BeautifulSoup?

With the code below:
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class' :'flagPageTitle'})
I get the following html:
<div id="ctl00_ContentPlaceHolder1_Item65404" class="flagPageTitle" style=" ">
<span></span><p>Some text here</p>
</div>
How can I get Some text here without any tags? Is there InnerText equivalent in BeautifulSoup?
All you need is:
result = soup.find('div', {'class' :'flagPageTitle'}).text
You can use findAll(text=True) to only find text nodes.
result = u''.join(result.findAll(text=True))
You can search for <p> and get its text:
soup = BeautifulSoup.BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class': 'flagPageTitle'})
result = result.find('p').text

Categories