Is there an InnerText equivalent in BeautifulSoup? - python

With the code below:
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class' :'flagPageTitle'})
I get the following html:
<div id="ctl00_ContentPlaceHolder1_Item65404" class="flagPageTitle" style=" ">
<span></span><p>Some text here</p>
</div>
How can I get Some text here without any tags? Is there InnerText equivalent in BeautifulSoup?

All you need is:
result = soup.find('div', {'class' :'flagPageTitle'}).text

You can use findAll(text=True) to only find text nodes.
result = u''.join(result.findAll(text=True))

You can search for <p> and get its text:
soup = BeautifulSoup.BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class': 'flagPageTitle'})
result = result.find('p').text

Related

How can I get all text inside paragraph tags that's inside a div element

So I'm trying to scrape a news website and get the actual text inside it. My problem right now is that the actual article is divided into several p tags who in turn are inside a div tag.
It looks like this:
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>
What I tried so far is this:
article = requests.get(url)
soup = BeautifulSoup(article.content, 'html.parser')
article_title = soup.find('h1').text
article_author = soup.find('a', class_='author-link').text
article_text = ''
for element in soup.find('div', class_='wysiwyg wysiwyg--all-content css-1vkfgk0'):
article_text += element.find('p').text
But it shows that 'NoneType' object has no attribute 'text'
Cause expected output is not that clear from the question - General approch would be to select all p in your div e.g. with css selectors extract the text and join() it by what ever you like:
article_text = '\n'.join(e.text for e in soup.select('div p'))
If you just like to scrape text from siblings of the h2 in your example use:
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
or with find() and find_next_siblings():
article_text = '\n'.join(e.text for e in soup.find('h2').find_next_siblings('p'))
Example
from bs4 import BeautifulSoup
html='''
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>'''
soup = BeautifulSoup(html)
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
print(article_text)
Output
text
text
text

Trouble getting <a> tag using BeautifulSoup

I need to get a href attribute from <а> tag, but it doesn't work
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
print(a_tags[0].p) #print <p> tag
print(a_tags[0].a) #print 'None'
print(a_tags[0].a.get('href')) #doesn't work
but if I try to print(a_tags) it shows them:
[<a href="/org/colleges/instrcol/Pages/pic1.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic1" src="iblock/6ba/%d0%90%d0%b1%d1%80%d0%b0%d0%bc%d0%be%d0%b2%20%d0%a1%d0%b5%d1%80%d0%b3%d0%b5%d0%b9%20%d0%90%d0%bd%d1%82%d0%be%d0%bd%d0%b8%d0%b4%d0%be%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic1</p></div>
</a>, <a href="/org/colleges/instrcol/Pages/pic2.aspx" style="display:block;" target="_blank">
<div style="min-height:360px;">
<img alt="pic2" src="iblock/1ee/%d0%90%d0%b3%d0%b0%d1%84%d0%be%d0%bd%d0%be%d0%b2%20%d0%9f%d0%b0%d0%b2%d0%b5%d0%bb%20%d0%92%d0%b8%d1%82%d0%b0%d0%bb%d1%8c%d0%b5%d0%b2%d0%b8%d1%87.jpg"/>
<p>Pic2</p></div>
</a>,
...
What is causing this problem?
You forgot to add href=True while using find_all()
Try this:
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a', href=True)
for a_tag in a_tags:
print(a_tag['href'])
a_tags contains <a> already.
Replace a_tags[0].a.get('href') with a_tags[0].get('href').

How to extract a column from HTML into a list? [duplicate]

With the code below:
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class' :'flagPageTitle'})
I get the following html:
<div id="ctl00_ContentPlaceHolder1_Item65404" class="flagPageTitle" style=" ">
<span></span><p>Some text here</p>
</div>
How can I get Some text here without any tags? Is there InnerText equivalent in BeautifulSoup?
All you need is:
result = soup.find('div', {'class' :'flagPageTitle'}).text
You can use findAll(text=True) to only find text nodes.
result = u''.join(result.findAll(text=True))
You can search for <p> and get its text:
soup = BeautifulSoup.BeautifulSoup(page.read(), fromEncoding="utf-8")
result = soup.find('div', {'class': 'flagPageTitle'})
result = result.find('p').text

Extract element from HTML

I have html:
<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>
With:
s = soup.find('div', {'class' : 'img-holder'}).h1
s = s.get_text()
Displays the 'Sample image'.
How do i get the image src using the same format?
Use img.attrs["src"]
Ex:
from bs4 import BeautifulSoup
s = """<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>"""
soup = BeautifulSoup(s, "html.parser")
s = soup.find('div', {'class' : 'img-holder'})
print( s.img.attrs["src"] )
Like this?
soup.find('img')['src']

Parsing <ul> tag using beautiful soup

Consider this code:
divTag = soup.find_all("div", {"class":"classname"})
print divTag
for tag in divTag:
ulTag = soup.find_all("ul", {"class":"classname"})
print ulTag
for tag in ulTag:
liTag = soup.find_all("li", {"class":"classname"})
print liTag
for tag in liTag:
diTag = soup.find_all("div", {"class":"classname"})
print diTag
for tag in diTag:
aTags = tag.find_next("a")
value = aTags.string
print value
It prints only "divTag" & "ulTag". I'm sure all the class names are right. There are about 7 'li' tags within the 'ul' tag but it does not print any of the 'li' tags. Please help. Thanks in advance.
UPDATE:
<div class="classname">
<ul auto-load="true" class="classname" data-href="">
<li class="classname">
<div class="classname">"value" string string1 <a class="muted"><abbr class="timeago" title=" 1 Jun, 2015, 10:23 am">7 hours ago</abbr></a>
</div>
</li>
<li>
</li>
</ul>
</div>
I basically want to extract the "string" value within the 'a' tag.
The full solution with a next_sibling
ulTag = soup.find("ul", {"class": "classname"})
aTags = ulTag.find_all("a")
for aTag in aTags:
sibling = aTag.next_sibling
siblingString = str(sibling).strip()
if len(siblingString) > 0:
print siblingString
Here every time you are searching in soup. So you are failing. You should search for a tag in it's parent tag.
Try something like this:
divTag = soup.find_all("div", {"class":"classname"})
for ulTag in divTag:
for liTag in ulTag.find_all("li", {"class":"classname"}):
for tag in liTag.find_all("div", {"class":"classname"}):
for aTag in tag.find_all('a'):
print aTag.string
For the html you provided, The output is:
"value"
string1
7 hours ago

Categories