Isolating title and text Beautiful Soup - python

I have a some code that parses out a div from a page then finds all "p" tags which will have a title and some some text
sample:
for fn in os.listdir('.'):
if os.path.isfile(fn):
url = "%s/%s" % (path, fn)
page = open(url)
soup = BeautifulSoup(page,'html.parser')
soup2 = soup.find("div", {"class": "aui-field-wrapper-content"})
print soup2.p.prettify()
for node in soup2.findAll('p'):
print ''.join(node.findAll(text=True))
which returns
sample:
<p>
<b>
<strong class="TooltipInline" data-toggle="tooltip" title="Molecular formula">
Mol. formula:
</strong>
</b>
C23H30O6
</p>
In this instance i want to individually access the title Mol. Formula: and the text "C23H30O6" currently I am able to return
Mol. formula: C23H30O6 but not the individual components. I am realtively new to beautiful soup and am unsure of how to reference each component of a "p" tag

The other way to approach the problem is to get the b element inside the p element and consider it your "label", then go sideways and get the next sibling element:
label = p.b
value = label.next_sibling.strip()
print(label.get_text(strip=True), value)
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <p>
... <b>
... <strong class="TooltipInline" data-toggle="tooltip" title="Molecular formula">
... Mol. formula:
... </strong>
... </b>
... C23H30O6
... </p>
... """
>>>
>>> soup = BeautifulSoup(data, "html.parser")
>>>
>>> p = soup.p
>>>
>>> label = p.b
>>> value = label.next_sibling.strip()
>>> print(label.get_text(strip=True), value)
Mol. formula: C23H30O6

Your method of findAll(text=True) is doing the same thing as the get_text() method from Beautiful Soup. It will get all the text in the <p> tag. If you have a stable format a simple way to do it would be:
ptext = node.get_text().split(':',1)
title = ptext[0].strip()
value = ptext[1].strip()
In reference to the child tag question note that the molecular formula isn't in any tag except for the <P> tag.

Related

python: beautiful soup extracting info

I am using beautiful soup to parse HTML as follows:
html_content2 ="""
<h3 style="cear: both;">
<abbr title="European Union">EU</abbr>Investment</h3>
<div class="conditions">
<p>bla bla bla
</p>
</div>
<p style="margin-bottom: 0;">
<span class="amount">66000 €</span>
</p>"""
I would like to extract the amount of money and the code I have is:
from bs4 import BeautifulSoup
html_content=html_content1
soup = BeautifulSoup(html_content, "lxml")
t3 = soup.find(lambda tag:tag.name=="h3" and ": Investment").find_next_sibling().find_next_sibling("p").find("span").contents
print(t3)
The intention here is the following:
get h3 tag WITH text Investment and from there get next sibling and another next sibling with tag p then span and get the contents
In this previous code I dont how to include the word "Investiment" in the lambda function.
I tried:
tag.name=="h3" and tag.contents==": Investment"
this does not work.
Use lambda tag: tag.name == "h3" and "Investment" in tag.text:
from bs4 import BeautifulSoup
html_content = """
<h3 style="cear: both;">
<abbr title="European Union">EU</abbr>Investment</h3>
<div class="conditions">
<p>bla bla bla
</p>
</div>
<p style="margin-bottom: 0;">
<span class="amount">66000 €</span>
</p>"""
soup = BeautifulSoup(html_content, "lxml")
amount = (
soup.find(lambda tag: tag.name == "h3" and "Investment" in tag.text)
.find_next("span", class_="amount")
.text
)
print(amount)
Prints:
66000 €
You can use CSS selectors too:
amount = soup.select_one("h3:-soup-contains('Investment') ~ * .amount").text
print(amount)
Prints:
66000 €
You can get it with just one statement (select) using CSS selectors :
# t3selector = ....
t3 = soup.select_one(t3selector).get_text(strip=True)
where
t3selector = 'h3:-soup-contains("Investment") ~ * .amount'
if you just want the text from an element with a class amount inside a sibling of a h3 containing "Investment", but if you want exactly
get h3 tag WITH text Investment and from there get next sibling and another next sibling with tag p then span and get the contents
you'll need a slightly more specific selector
t3selector = 'h3:-soup-contains("Investment") + * + p span'
( if the span must be a child [direct descendant] of p, then end with p > span )
And the correct way to utilize lambda here would be something like:
t3 = soup.find(
lambda tag:tag.name=="h3" and "Investment" in tag.text
).find_next_sibling("p").find("span").get_text(strip=True)
but I think the selector method is better because this has too many calls - if any of the finds return None, an error will be raised.
Actually, even with the selector, it's safest to use
t3 = soup.select_one(t3selector)
if t3 is not None: t3 = t3.get_text(strip=True)

How can I get all text inside paragraph tags that's inside a div element

So I'm trying to scrape a news website and get the actual text inside it. My problem right now is that the actual article is divided into several p tags who in turn are inside a div tag.
It looks like this:
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>
What I tried so far is this:
article = requests.get(url)
soup = BeautifulSoup(article.content, 'html.parser')
article_title = soup.find('h1').text
article_author = soup.find('a', class_='author-link').text
article_text = ''
for element in soup.find('div', class_='wysiwyg wysiwyg--all-content css-1vkfgk0'):
article_text += element.find('p').text
But it shows that 'NoneType' object has no attribute 'text'
Cause expected output is not that clear from the question - General approch would be to select all p in your div e.g. with css selectors extract the text and join() it by what ever you like:
article_text = '\n'.join(e.text for e in soup.select('div p'))
If you just like to scrape text from siblings of the h2 in your example use:
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
or with find() and find_next_siblings():
article_text = '\n'.join(e.text for e in soup.find('h2').find_next_siblings('p'))
Example
from bs4 import BeautifulSoup
html='''
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>'''
soup = BeautifulSoup(html)
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
print(article_text)
Output
text
text
text

Extract content with BeautifulSoup and Python

I'm trying to scrap a forum but I can't deal with the comments, because the users use emoticons, and bold font, and cite previous messages, and and and...
For example, here's one of the comments that I have a problem with:
<div class="content">
<blockquote>
<div>
<cite>User write:</cite>
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
</div>
</blockquote>
<br/>
THIS IS THE COMMENT THAT I NEED!
</div>
I searching for help for the last 4 days and I couldn't find anything, so I decided to ask here.
This is the code that I'm using:
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_messages(url):
soup = make_soup(url)
msg = soup.find("div", {"class" : "content"})
# I get in msg the hole message, exactly as I wrote previously
print msg
# Here I get:
# 1. <blockquote> ... </blockquote>
# 2. <br/>
# 3. THIS IS THE COMMENT THAT I NEED!
for item in msg.children:
print item
I'm looking for a way to deal with messages in a general way, no matter how they are. Sometimes they put emoticons between the text and I need to remove them and get the hole message (in this situation, bsp will put each part of the message (first part, emoticon, second part) in different items).
Thanks in advance!
Use decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
Decompose extracts tags that you don't want. In your case:
soup.blockquote.decompose()
or all unwanted tags:
for tag in ['blockquote', 'img', ... ]:
soup.find(tag).decompose()
Your example:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
... <blockquote>
... <div>
... <cite>User write:</cite>
... I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
... </div>
... </blockquote>
... <br/>
... THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'
Update
Sometimes all you have is a tag starting point but you are actually interested in the content before or after that starting point. You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:
>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
... txt += elm.string
... elm = elm.next_sibling
...
>>> print(txt)
u'Yes.Yes!Yes?'
BeautifulSoup has a get_text method. Maybe this is what you want.
From their documentation:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
If the text you want is never within any additional tags, as in your example, you can use extract() to get rid of all the tags and their contents:
html = '<div class="content">\
<blockquote>\
<div>\
<cite>User write:</cite>\
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">\
</div>\
</blockquote>\
<br/>\
THIS IS THE COMMENT THAT I NEED!\
</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
tag.extract()
text = div.get_text(strip=True)
print(text)
This gives:
THIS IS THE COMMENT THAT I NEED!
To deal with emoticons, you'll have to do something more complicated. You'll probably have to define a list of emoticons to recognize yourself, and then parse the text to look for them.

Python splitting the HTML

So I have an HTML markup and I'd like to access a tag with a specific class inside a tag with a specific id. For example:
<tr id="one">
<span class="x">X</span>
.
.
.
.
</tr>
How do I get the content of the tag with the class "x" inside the tag with an id of "one"?
I'm not used to work with lxml.xpath, so I always tend to use BeautifulSoup. Here is a solution with BeautifulSoup:
>>> HTML = """<tr id="one">
... <span class="x">X</span>
... <span class="ax">X</span>
... <span class="xa">X</span>
... </tr>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(HTML)
>>> tr = soup.find('tr', {'id':'one'})
>>> span = tr.find('span', {'class':'x'})
>>> span
<span class="x">X</span>
>>> span.text
u'X'
You need something called "xpath".
from lxml import html
tree = html.fromstring(my_string)
x = tree.xpath('//*[#id="one"]/span[#class="x"]/text()')
print x[0] # X

Python beautifulsoup extract date between <a> tag (and associate it with his own url) [duplicate]

I have an html code like this:
<h2 class="title">My HomePage</h2>
<h2 class="title">Sections</h2>
I need to extract the texts (link descriptions) between 'a' tags. I need an array to store these like:
a[0] = "My HomePage"
a[1] = "Sections"
I need to do this in python using BeautifulSoup.
Please help me, thank you!
You can do something like this:
import BeautifulSoup
html = """
<html><head></head>
<body>
<h2 class='title'><a href='http://www.gurletins.com'>My HomePage</a></h2>
<h2 class='title'><a href='http://www.gurletins.com/sections'>Sections</a></h2>
</body>
</html>
"""
soup = BeautifulSoup.BeautifulSoup(html)
print [elm.a.text for elm in soup.findAll('h2', {'class': 'title'})]
# Output: [u'My HomePage', u'Sections']
print [a.findAll(text=True) for a in soup.findAll('a')]
The following code extracts text (link descriptions) between 'a' tags and stores in an array.
>>> from bs4 import BeautifulSoup
>>> data = """<h2 class="title"><a href="http://www.gurletins.com">My
HomePage</a></h2>
...
... <h2 class="title">Sections
</h2>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqTxt = soup.find_all("h2", {"class":"title"})
>>> a = []
>>> for i in reqTxt:
... a.append(i.get_text())
...
>>> a
['My HomePage', 'Sections']
>>> a[0]
'My HomePage'
>>> a[1]
'Sections'

Categories