Trouble getting <a> tag using BeautifulSoup - python

I need to get a href attribute from <а> tag, but it doesn't work
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
print(a_tags[0].p) #print <p> tag
print(a_tags[0].a) #print 'None'
print(a_tags[0].a.get('href')) #doesn't work
but if I try to print(a_tags) it shows them:
[<a href="/org/colleges/instrcol/Pages/pic1.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic1" src="iblock/6ba/%d0%90%d0%b1%d1%80%d0%b0%d0%bc%d0%be%d0%b2%20%d0%a1%d0%b5%d1%80%d0%b3%d0%b5%d0%b9%20%d0%90%d0%bd%d1%82%d0%be%d0%bd%d0%b8%d0%b4%d0%be%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic1</p></div>
</a>, <a href="/org/colleges/instrcol/Pages/pic2.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic2" src="iblock/1ee/%d0%90%d0%b3%d0%b0%d1%84%d0%be%d0%bd%d0%be%d0%b2%20%d0%9f%d0%b0%d0%b2%d0%b5%d0%bb%20%d0%92%d0%b8%d1%82%d0%b0%d0%bb%d1%8c%d0%b5%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic2</p></div>
</a>,
...
What is causing this problem?

You forgot to add href=True while using find_all()
Try this:
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a', href=True)
for a_tag in a_tags:
print(a_tag['href'])

a_tags contains <a> already.
Replace a_tags[0].a.get('href') with a_tags[0].get('href').

Related

How to just get the content of the tag when you use findAll in Beautiful Soup?

So on the project I'm building I want to find the price contained on the multiple results I got with the findAll() command. Here's the code:
soup = BeautifulSoup(driver.page_source, 'html.parser')
price = soup.find_all(class_='search-result__market-price--value')
print(price)
And this is what I get:
[<span class="search-result__market-price--value" tabindex="-1"> $0.11 </span>, <span class="search-result__market-price--value" tabindex="-1"> $0.24 </span>, ... ]
I tried using this code I found somewhere else price = soup.find_all(class_='search-result__market-price--value')[0].string but it just gave the error IndexError: list index out of range.
What can I do to just get the numbers?
Iterate the ResultSet created by find_all():
soup = BeautifulSoup(driver.page_source, 'html.parser')
for price in soup.find_all(class_='search-result__market-price--value'):
print(price.text)
or to just get the numbers
print(price.text.split('$')[-1])
Example
from bs4 import BeautifulSoup
html='''
<span class="search-result__market-price--value" tabindex="-1"> $0.11 </span>
<span class="search-result__market-price--value" tabindex="-1"> $0.24 </span>
'''
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all('span'):
print(tag.text.split('$')[-1])
Output
0.11
0.24

How can I get all text inside paragraph tags that's inside a div element

So I'm trying to scrape a news website and get the actual text inside it. My problem right now is that the actual article is divided into several p tags who in turn are inside a div tag.
It looks like this:
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>
What I tried so far is this:
article = requests.get(url)
soup = BeautifulSoup(article.content, 'html.parser')
article_title = soup.find('h1').text
article_author = soup.find('a', class_='author-link').text
article_text = ''
for element in soup.find('div', class_='wysiwyg wysiwyg--all-content css-1vkfgk0'):
article_text += element.find('p').text
But it shows that 'NoneType' object has no attribute 'text'
Cause expected output is not that clear from the question - General approch would be to select all p in your div e.g. with css selectors extract the text and join() it by what ever you like:
article_text = '\n'.join(e.text for e in soup.select('div p'))
If you just like to scrape text from siblings of the h2 in your example use:
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
or with find() and find_next_siblings():
article_text = '\n'.join(e.text for e in soup.find('h2').find_next_siblings('p'))
Example
from bs4 import BeautifulSoup
html='''
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>'''
soup = BeautifulSoup(html)
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
print(article_text)
Output
text
text
text

How to get child value of div seperately using beautifulsoup

I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))

Web scraping with Beautiful Soup gives empty ResultSet

I am experimenting with Beautiful Soup and I am trying to extract information from a HTML document that contains segments of the following type:
<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>
I have used the following commands:
with open("C:\Users\pv\MyFiles\HTML\Invites.html","r") as Invites: soup = bs(Invites, 'lxml')
soup.title
out: <title>Sent Invites\n| LinkedIn\n</title>
invites = soup.find_all("div", class_ = "entity-body")
type(invites)
out: bs4.element.ResultSet
len(invites)
out: 0
Why find_all returns empty ResultSet object?
Your advice will be appreciated.
The problem is that the document is not read, it is a just TextIOWrapper (Python 3) or File(Python 2) object. You have to read the documet and pass markup, essentily a string to BeautifulSoup.
The correct code would be:
with open("C:\Users\pv\MyFiles\HTML\Invites.html", "r") as Invites:
soup = BeautifulSoup(Invites.read(), "html.parser")
soup.title
invites = soup.find_all("div", class_="entity-body")
len(invites)
import bs4
html = '''<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
invites = soup.find_all("div", class_ = "entity-body")
len(invites)
out:
1
this code works fine

Is there an InnerText equivalent in BeautifulSoup?

With the code below:
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class' :'flagPageTitle'})
I get the following html:
<div id="ctl00_ContentPlaceHolder1_Item65404" class="flagPageTitle" style=" ">
<span></span><p>Some text here</p>
</div>
How can I get Some text here without any tags? Is there InnerText equivalent in BeautifulSoup?
All you need is:
result = soup.find('div', {'class' :'flagPageTitle'}).text
You can use findAll(text=True) to only find text nodes.
result = u''.join(result.findAll(text=True))
You can search for <p> and get its text:
soup = BeautifulSoup.BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class': 'flagPageTitle'})
result = result.find('p').text

Categories