Python BeautifulSoup extract text between element - python

I try to extract "THIS IS MY TEXT" from the following HTML:
<html>
<body>
<table>
<td class="MYCLASS">
<!-- a comment -->
<a hef="xy">Text</a>
<p>something</p>
THIS IS MY TEXT
<p>something else</p>
</br>
</td>
</table>
</body>
</html>
I tried it this way:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
print hit.text
But I get all the text between all nested Tags plus the comment.
Can anyone help me to just get "THIS IS MY TEXT" out of this?

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
To move down the parse tree you have contents and string.
contents is an ordered list of the Tag and NavigableString objects
contained within a page element
if a tag has only one child node, and that child node is a string,
the child node is made available as tag.string, as well as
tag.contents[0]
For the above, that is to say you can get
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'
For several children nodes, you can have for instance
pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
so here you may play with contents and get contents at the index you want.
You also can iterate over a Tag, this is a shortcut. For instance,
for i in soup.body:
print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

Use .children instead:
from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children
if isinstance(child, NavigableString) and not isinstance(child, Comment))
Yes, this is a bit of a dance.
Output:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
... print ''.join(unicode(child) for child in hit.children
... if isinstance(child, NavigableString) and not isinstance(child, Comment))
...
THIS IS MY TEXT

You can use .contents:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
... print hit.contents[6].strip()
...
THIS IS MY TEXT

with your own soup object:
soup.p.next_sibling.strip()
you grab the <p> directly with soup.p *(this hinges on it being the first <p> in the parse tree)
then use next_sibling on the tag object that soup.p returns since the desired text is nested at the same level of the parse tree as the <p>
.strip() is just a Python str method to remove leading and trailing whitespace
*otherwise just find the element using your choice of filter(s)
in the interpreter this looks something like:
In [4]: soup.p
Out[4]: <p>something</p>
In [5]: type(soup.p)
Out[5]: bs4.element.Tag
In [6]: soup.p.next_sibling
Out[6]: u'\n THIS IS MY TEXT\n '
In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString
In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'
In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

Short answer: soup.findAll('p')[0].next
Real answer: You need an invariant reference point from which you can get to your target.
You mention in your comment to Haidro's answer that the text you want is not always in the same place. Find a sense in which it is in the same place relative to some element. Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path.
For example, in the HTML you provide in the original post, the target string appears immediately after the first paragraph element, and that paragraph is not empty. Since findAll('p') will find paragraph elements, soup.find('p')[0] will be the first paragraph element.
You could in this case use soup.find('p') but soup.findAll('p')[n] is more general since maybe your actual scenario needs the 5th paragraph or something like that.
The next field attribute will be the next parsed element in the tree, including children. So soup.findAll('p')[0].next contains the text of the paragraph, and soup.findAll('p')[0].next.next will return your target in the HTML provided.

soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
hit = hit.text.strip()
print hit
This will print: THIS IS MY TEXT
Try this..

The BeautifulSoup documentation provides an example about removing objects from a document using the extract method. In the following example the aim is to remove all comments from the document:
Removing Elements
Once you have a reference to an element, you can rip it out of the
tree with the extract method. This code removes all the comments
from a document:
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

Related

Remove elements from tree based on list of terms

I'm trying to capture some text from a webpage (whose URL is passed when running the script), but its buried in a paragraph tag with no other attributes assigned. I can collect the contents of every paragraph tag, but I want to remove any elements from the tree that contain any of a list of keywords.
I get the following error:
tree.remove(elem) TypeError: Argument 'element' has incorrect type
(expected lxml.etree._Element, got _ElementStringResult)
I understand that what I am getting back when I try to iterate through the tree is the wrong type, but how do I get the element instead?
Sample Code:
#!/usr/bin/python
from lxml import html
from lxml import etree
url = sys.argv[1]
page = requests.get(url)
tree = html.fromstring(page.content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
tree.remove(elem)
In your code, elem is an _ElementStringResult which has the instance method getparent. Its parent is an Element object of one of the <p> nodes.
The parent has a remove method which can be used to remove it from the tree:
element.getparent().remove(element)
I do not believe there is a more direct way and I don't have a good answer to why there isn't a removeself method.
Using the example html:
content = '''
<root>
<p> nothing1 </p>
<p> keyword1 </p>
<p> nothing2 </p>
<p> nothing3 </p>
<p> keyword4 </p>
</root>
'''
You can see this in action in your code with:
from lxml import html
from lxml import etree
tree = html.fromstring(content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
actual_element = elem.getparent()
actual_element.getparent().remove(actual_element)
for child in tree.getchildren():
print('<{tag}>{text}</{tag}>'.format(tag=child.tag, text=child.text))
# Output:
# <p> nothing1 </p>
# <p> nothing2 </p>
# <p> nothing3 </p>
From the comments, it seems like this code isn't working for you. If so, you might need to provide more information about the structure of your html.

BeautifulSoup replace_with for non-standard tags

I'm trying to write a parser that will take HTML and convert/output to Wiki syntax (<b> = ''', <i> = '', etc).
So far, BeautifulSoup seems only capable of replacing the contents within a tag, so <b> becomes <'''> instead of '''. I can use a re.sub() to swap these out, but since BS turns the document into a 'complex tree of Python objects', I can't figure out how to swap out these tags and re-insert them into the overall document.
Does anyone have ideas?
I am pretty sure there are already tools that would do that for you, but if you are asking about how to do that with BeautifulSoup, you can use replace_with(), but you would need to preserve the text of the element. Naive and simple example:
from bs4 import BeautifulSoup
data = """
<div>
<b>test1</b>
<i>test2</i>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
b.replace_with("'''%s'''" % b.text)
for i in soup.find_all("i"):
i.replace_with("''%s''" % i.text)
print(soup.prettify())
Prints:
<div>
'''test1'''
''test2''
</div>
To also handle nested tags, e.g. "<div><b>bold with some <i>italics</i></b></div>" you have to be a bit more careful.
I put together the following implementation when I needed to do something similar:
from bs4 import BeautifulSoup
def wikify_tag(tag, replacement):
tag.insert(0, replacement)
tag.append(replacement)
tag.unwrap()
data = """
<div>
<b>test1</b>
<i>test2</i>
<b>bold with some <i>italics</i></b>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
wikify_tag(b, "'''")
for i in soup.find_all("i"):
wikify_tag(i, "''")
print(soup)
Prints (note that .prettify() makes it look uglier):
<div>
'''test1'''
''test2''
'''bold with some ''italics'''''
</div>
If you also want to replace tags with wiki-templates you can extend wikify_tag to take a start and an end string.

How to add <br> tags with BeautifulSoup?

So let's say I have
<p>Hello World</p>
Can BeautifulSoup add a tag like so?
<br><p>Hello World</p>
Initially I could get around this by doing something like:
soup = BeautifulSoup("<p>Hello World<p>")
soup = BeautifulSoup(re.compile('(<p>)', '<br>\1', soup.prettify())
but the problem is that in actual usage with more complex html the .prettify() messes up the html by adding extra whitespace and lines.
I checked the docs but it doesn't even mention the
<br>
tag at all.
It can be done using the soup.insert() function
>>> br = soup.new_tag('br')
>>> br
<br/>
>>> soup = BeautifulSoup("<p>Hello World</p>")
>>> soup.insert(0,br)
>>> soup
<br/><p>Hello World</p>
The insert() function inserts a tag at any numeric position. Here we have specified as 0 so it is inserted at the start.

BeautifulSoup - removing MS Word specific tags?

I have html document that was saved from MS Word and now it has some tags related with MS Word. I don't need too keep any backwards compatibility with it, I just need to extract contents from that file. The problem is that word specific tags are not removed so easily.
I have this code:
from bs4 import BeautifulSoup, NavigableString
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return soup
It removes not needed tags. But some are left even after using this method.
For example look at this:
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">Some text -
some content<o:p></o:p></SPAN></P>
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">some text2 -
647894654<o:p></o:p></SPAN></P>
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">some text3 -
some content blabla<o:p></o:p></SPAN></P>
This is how it look inside html document. When I use method like this:
invalid_tags = ['span']
stripped = strip_tags(html_file, invalid)
print stripped
It prints like this:
<p class="MsoNormal">Some text -
some content<html><body><o:p></o:p></body></html></p>
<p class="MsoNormal">some text2 -
647894654<html><body><o:p></o:p></body></html></p>
<p class="MsoNormal">some text3 -
some content blabla<html><body><o:p></o:p></body></html></p>
As you can see for some reason html and body tags appeared there even though in html it does not exist. If I add invalid_tags = ['span', 'o:p'], it removes <o:p></o:p> tags, but if I add to remove html or body tags, it does not do anything and it is still kept there.
P.S. I can remove html tags there if I directly change where to look for finding tags. For example by adding this line in a method (before findAll is used) soup = soup.body. But still after this, body tags are kept hanging in those specific paragraphs.
you can try this:
from bs4 import BeautifulSoup
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for t in invalid_tags:
tag = soup.find_all(t)
if tag:
for item in tag:
item.unwrap()
return str(soup)
then you just need to strip html and body tag.

Search and Replace in HTML with BeautifulSoup

I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.
EDIT
I would assume it would be something similar to:
soup.findAll('a').
In the documentation, there is a:
find(text="ahh").replaceWith('Hooray')
So I would assume it would be along the lines of:
soup.findAll(tag = '</a>').replaceWith(tag = '</a><br>')
But that doesn't work and the python help() doesn't give much
This will insert a <br> tag after the end of each <a>...</a> element:
from BeautifulSoup import BeautifulSoup, Tag
# ....
soup = BeautifulSoup(data)
for a in soup.findAll('a'):
a.parent.insert(a.parent.index(a)+1, Tag(soup, 'br'))
You can't use soup.findAll(tag = '</a>') because BeautifulSoup doesn't operate on the end tags separately - they are considered part of the same element.
If you wanted to put the <a> elements inside a <p> element as you ask in a comment, you can use this:
for a in soup.findAll('a'):
p = Tag(soup, 'p') #create a P element
a.replaceWith(p) #Put it where the A element is
p.insert(0, a) #put the A element inside the P (between <p> and </p>)
Again, you don't create the <p> and </p> separately because they are part of the same thing.
suppose you have an element which you know contains the "br" markup tags, one way to remove & replace the "br" tags with a different string is like this:
originalSoup = BeautifulSoup("your_html_file.html")
replaceString = ", " # replace each <br/> tag with ", "
# Ex. <p>Hello<br/>World</p> to <p>Hello, World</p>
cleanSoup = BeautifulSoup(str(originalSoup).replace("<br/>", replaceString))
You don't replace an end-tag; in BeautifulSoup you are dealing with a document object model like in a browser, not a string full of HTML. So you couldn't ‘replace’ an end-tag without also replacing the start-tag.
What you want to do is insert a new <br> element immediately after the <a>...</a> element. To do so you'll need to find out the index of the <a> element inside its parent element, and insert the new element just after that index. eg.
soup= BeautifulSoup('<body>blah blah blah</body>')
for link in soup.findAll('a'):
br= Tag(soup, 'br')
index= link.parent.contents.index(link)
link.parent.insert(index+1, br)
# soup now serialises to '<body>blah blah<br /> blah</body>'

Categories