python: beautiful soup extracting info - python

I am using beautiful soup to parse HTML as follows:
html_content2 ="""
<h3 style="cear: both;">
<abbr title="European Union">EU</abbr>Investment</h3>
<div class="conditions">
<p>bla bla bla
</p>
</div>
<p style="margin-bottom: 0;">
<span class="amount">66000 €</span>
</p>"""
I would like to extract the amount of money and the code I have is:
from bs4 import BeautifulSoup
html_content=html_content1
soup = BeautifulSoup(html_content, "lxml")
t3 = soup.find(lambda tag:tag.name=="h3" and ": Investment").find_next_sibling().find_next_sibling("p").find("span").contents
print(t3)
The intention here is the following:
get h3 tag WITH text Investment and from there get next sibling and another next sibling with tag p then span and get the contents
In this previous code I dont how to include the word "Investiment" in the lambda function.
I tried:
tag.name=="h3" and tag.contents==": Investment"
this does not work.

Use lambda tag: tag.name == "h3" and "Investment" in tag.text:
from bs4 import BeautifulSoup
html_content = """
<h3 style="cear: both;">
<abbr title="European Union">EU</abbr>Investment</h3>
<div class="conditions">
<p>bla bla bla
</p>
</div>
<p style="margin-bottom: 0;">
<span class="amount">66000 €</span>
</p>"""
soup = BeautifulSoup(html_content, "lxml")
amount = (
soup.find(lambda tag: tag.name == "h3" and "Investment" in tag.text)
.find_next("span", class_="amount")
.text
)
print(amount)
Prints:
66000 €
You can use CSS selectors too:
amount = soup.select_one("h3:-soup-contains('Investment') ~ * .amount").text
print(amount)
Prints:
66000 €

You can get it with just one statement (select) using CSS selectors :
# t3selector = ....
t3 = soup.select_one(t3selector).get_text(strip=True)
where
t3selector = 'h3:-soup-contains("Investment") ~ * .amount'
if you just want the text from an element with a class amount inside a sibling of a h3 containing "Investment", but if you want exactly
get h3 tag WITH text Investment and from there get next sibling and another next sibling with tag p then span and get the contents
you'll need a slightly more specific selector
t3selector = 'h3:-soup-contains("Investment") + * + p span'
( if the span must be a child [direct descendant] of p, then end with p > span )
And the correct way to utilize lambda here would be something like:
t3 = soup.find(
lambda tag:tag.name=="h3" and "Investment" in tag.text
).find_next_sibling("p").find("span").get_text(strip=True)
but I think the selector method is better because this has too many calls - if any of the finds return None, an error will be raised.
Actually, even with the selector, it's safest to use
t3 = soup.select_one(t3selector)
if t3 is not None: t3 = t3.get_text(strip=True)

Related

BeautifulSoup - extracting text from multiple span elements w/o classes

So that's how HTML looks:
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>
I need to extract detail2 & detail3.
But with this piece of code I only get detail1.
info = data.find("p", class_ = "details").span.text
How do I extract the needed items?
Thanks in advance!
Select your elements more specific in your case all sibling <span> of <span> with class number:
soup.select('span.number ~ span')
Example
from bs4 import BeautifulSoup
html='''<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('span.number ~ span')]
Output
['detail2', 'detail3']
You can find all <span>s and do normal indexing:
from bs4 import BeautifulSoup
html_doc = """\
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
spans = soup.find("p", class_="details").find_all("span")
for s in spans[-2:]:
print(s.text)
Prints:
detail2
detail3
Or CSS selectors:
spans = soup.select(".details span:nth-last-of-type(-n+2)")
for s in spans:
print(s.text)
Prints:
detail2
detail3

How can I get all text inside paragraph tags that's inside a div element

So I'm trying to scrape a news website and get the actual text inside it. My problem right now is that the actual article is divided into several p tags who in turn are inside a div tag.
It looks like this:
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>
What I tried so far is this:
article = requests.get(url)
soup = BeautifulSoup(article.content, 'html.parser')
article_title = soup.find('h1').text
article_author = soup.find('a', class_='author-link').text
article_text = ''
for element in soup.find('div', class_='wysiwyg wysiwyg--all-content css-1vkfgk0'):
article_text += element.find('p').text
But it shows that 'NoneType' object has no attribute 'text'
Cause expected output is not that clear from the question - General approch would be to select all p in your div e.g. with css selectors extract the text and join() it by what ever you like:
article_text = '\n'.join(e.text for e in soup.select('div p'))
If you just like to scrape text from siblings of the h2 in your example use:
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
or with find() and find_next_siblings():
article_text = '\n'.join(e.text for e in soup.find('h2').find_next_siblings('p'))
Example
from bs4 import BeautifulSoup
html='''
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>'''
soup = BeautifulSoup(html)
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
print(article_text)
Output
text
text
text

Extract content of div tag except other tags using BeuatifulSoup

I have below HTML content, wherein div tag looks like below
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
From above I want to extract text only as "aaa" and not other tags content.
When I do,
soup.find('div', {"class": "block"})
it gives me all the content as text and I want to avoid the contents of p tag.
Is there a method available in BeautifulSoup to do this?
Check the type of element,You could try:
from bs4 import BeautifulSoup
from bs4 import element
s = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
<h1>ddd</h1>
</div>
'''
soup = BeautifulSoup(s, "lxml")
for e in soup.find('div', {"class": "block"}):
if type(e) == element.NavigableString and e.strip():
print(e.strip())
# aaa
And this will ignore all text in sub tags.
You can remove the p tags from that div, which effectively gives you the aaa text.
Here's how:
from bs4 import BeautifulSoup
sample = """<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
"""
s = BeautifulSoup(sample, "html.parser")
excluded = [i.extract() for i in s.find("div", class_="block").find_all("p")]
print(s.text.strip())
Output:
aaa
You can use find_next(), which returns the first match found:
from bs4 import BeautifulSoup
html = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find('div', {"class": "block"}).find_next(text=True))
Output:
aaa

Isolating title and text Beautiful Soup

I have a some code that parses out a div from a page then finds all "p" tags which will have a title and some some text
sample:
for fn in os.listdir('.'):
if os.path.isfile(fn):
url = "%s/%s" % (path, fn)
page = open(url)
soup = BeautifulSoup(page,'html.parser')
soup2 = soup.find("div", {"class": "aui-field-wrapper-content"})
print soup2.p.prettify()
for node in soup2.findAll('p'):
print ''.join(node.findAll(text=True))
which returns
sample:
<p>
<b>
<strong class="TooltipInline" data-toggle="tooltip" title="Molecular formula">
Mol. formula:
</strong>
</b>
C23H30O6
</p>
In this instance i want to individually access the title Mol. Formula: and the text "C23H30O6" currently I am able to return
Mol. formula: C23H30O6 but not the individual components. I am realtively new to beautiful soup and am unsure of how to reference each component of a "p" tag
The other way to approach the problem is to get the b element inside the p element and consider it your "label", then go sideways and get the next sibling element:
label = p.b
value = label.next_sibling.strip()
print(label.get_text(strip=True), value)
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <p>
... <b>
... <strong class="TooltipInline" data-toggle="tooltip" title="Molecular formula">
... Mol. formula:
... </strong>
... </b>
... C23H30O6
... </p>
... """
>>>
>>> soup = BeautifulSoup(data, "html.parser")
>>>
>>> p = soup.p
>>>
>>> label = p.b
>>> value = label.next_sibling.strip()
>>> print(label.get_text(strip=True), value)
Mol. formula: C23H30O6
Your method of findAll(text=True) is doing the same thing as the get_text() method from Beautiful Soup. It will get all the text in the <p> tag. If you have a stable format a simple way to do it would be:
ptext = node.get_text().split(':',1)
title = ptext[0].strip()
value = ptext[1].strip()
In reference to the child tag question note that the molecular formula isn't in any tag except for the <P> tag.

Unable to fetch <div> tag values in python

The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?
An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]
If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())

Categories