Editing tree in place while iterating in lxml - python

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?

To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

Related

BeautifulSoup (bs4) parsing wrong

Parsing this sample document with bs4, from python 2.7.6:
<html>
<body>
<p>HTML allows omitting P end-tags.
<p>Like that and this.
<p>And this, too.
<p>What happened?</p>
<p>And can we <p>nest a paragraph, too?</p></p>
</body>
</html>
Using:
from bs4 import BeautifulSoup as BS
...
tree = BS(fh)
HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those paragraphs until it sees </body>:
<html>
<body>
<p>
HTML allows omitting P end-tags.
<p>
Like that and this.
<p>
And this, too.
<p>
What happened?
</p>
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
</p>
</p>
</p>
</body>
It's not prettify()'s fault, because traversing the tree manually I get the same structure:
<[document]>
<html>
␊
<body>
␊
<p>
HTML allows omitting P end-tags.␊␊
<p>
Like that and this.␊␊
<p>
And this, too.␊␊
<p>
What happened?
</p>
␊
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
␊
</p>
</p>
</p>
</body>
␊
</html>
␊
</[document]>
Now, this would be the right result for XML (at least up to </body>, at which point it should report a WF error). But this ain't XML. What gives?
The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6.
Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces the correct result:
tree = BS(htmSource, "html5lib")

Parse HTML string from a file and remove element using xpath and write it to same file in python

For my project, I have to remove selective content from a html file using python xpath.
The selective element can be removed using .remove() method, however the content in file looks same.
How do I write the modified content to that file again?
Though If i try to write the same tree object to file using open().write(etree.tostring(tree_obj)), will the content differs for unicode pages? Is there anyother way to keep the modified file?
Why the header tags in below output has different value after printing the tree object?
Please suggest.
Below is my code sample.
Example: I need to remove all div tags inside the html page.
HTML file:
<html>
<head>test</head>
<body>
<p>welcome to the world</p>
<div id="one">
<p>one</p>
<div id="one1">one1</div>
<ul>
<li>ones</li>
<li>twos</li>
<li>threes</li>
</ul>
</div>
<div id="hell">
<p>heaven</p>
<div id="one1">one1</div>
<ul>
<li>ones</li>
<li>twos</li>
<li>threes</li>
</ul>
</div>
<input type="text" placeholder="enter something.." />
<input type="button" value="click" />
</body>
</html>
Python file:
# _*_ coding:utf-8 _*_
import os
import sys
import traceback
import datetime
from lxml import etree, html
import shutil
def process():
fd=open("D:\\hello.html")
tree = html.fromstring(fd.read())
remove_tag = '//div'
for element in tree.xpath(remove_tag):
element.getparent().remove(element)
print etree.tostring(tree)
process()
OUTPUT:
<html>
<head/><body><p>test
</p>
<p>welcome to the world</p>
<input type="text" placeholder="enter something.."/>
<input type="button" value="click"/>
</body></html>
I haven't worked on python but i have played with parsing html based websites using Java with help of library jsoup.
Python also has similar one like this. Beautiful soup. You can play with this thing to get desired output.
Hope it helps.
Have you tried using python's standard library re?
import re</br>
re.sub('<.*?>','', '<nb>foobar<aon><mn>')
re.sub('</.*?>','', '</nb>foobar<aon><mn>')
The above two operations could be used in combination to remove all the html tags. It can be easily modified to remove the div tags too.

Parse paragraph and subsequent element with BeautifulSoup with one loop-cycle

I have a long list of blog comments which are coded as
<p> This is the text <br /> of the comment </p>
<div id="comment_details"> Here the details of the same comment </div>
I need to parse comment and details in the same cycle of the loop so to store them orderly.
Yet I am not sure how I should proceed because I can parse them easily in two different loops. Is it elegant and practical to do it in only one?
Please consider the following MWE
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<body>
<div id="firstDiv">
<br></br>
<p>My first paragraph.<br />But this a second line</p>
<div id="secondDiv">
<b>Date1</b>
</div>
<br></br>
<p>My second paragraph.</p>
<div id="secondDiv">
<b>Date2</b>
</div>
<br></br>
<p>My third paragraph.</p>
<div id="secondDiv">
<b>Date3</b>
</div>
<br></br>
<p>My fourth paragraph.</p>
<div id="secondDiv">
<b>Date4</b>
</div>
<br></br>
<p>My fifth paragraph.</p>
<div id="secondDiv">
<b>Date5</b>
</div>
<br></br>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
for div in soup.find(id="firstDiv").find_all(id="secondDiv"):
print div.b.get_text()
If you really wanted the subsequent sibling, that would be easy:
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
print p.next_sibling.b.get_text()
However, the next thing after the p is the string `'\n', not the div you want.
The problem is, there is no real structural relationship between the p and the div; it just so happens that each p always has a div with a certain id as a later sibling, and you want to exploit that. (If you're generating this HTML, obviously fix the structure… but I'll assume you aren't.) So, here are some options.
The best is probably:
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
print p.find_next_sibling(id='secondDiv').b.get_text()
If you only care about this particular document, and you know that it will always be true that the next sibling after the next sibling is the div you want:
print p.get_text()
print p.next_sibling.next_sibling.b.get_text()
Or you could rely on the fact that find_next_sibling() with no argument, unlike next_sibling, skips to the first actual DOM element, so:
print p.get_text()
print p.get_next_sibling().b.get_text()
If you don't want to rely on any of that, but can count on the fact that they're always one-to-one (that is, no chance of any stray p elements without a corresponding secondDiv), you can just zip the two searches together:
fdiv = soup.find(id='firstDiv')
for p, sdiv in zip(fdiv.find_all('p'), fdiv.find_all(id='secondDiv'):
print p.get_text(), div.b.get_text()
You could also iterate p.next_siblings to find the element you want:
for p in soup.find(id='firstDiv').find_all('p'):
div = next(sib for sib in p.next_siblings if sib.id == 'secondDiv')
print p.get_text(), div.b.get_text()
But ultimately, that's just a more verbose way of writing the first solution, so go back to that one. :)

Split html document in pieces parsing html comments with BeautifulSoup

This is a pretty small question that has been almost resolved in a previous question.
Problem is that right now i have and array of comments, but it does not quite what I need. I get an array of comments-content. And I need to get the html in-between.
Say I have something like:
<p>some html here<p>
<!-- begin mark -->
<p>Html i'm interested at.</p>
<p>More html i want to pull out of the document.</p>
<!-- end mark -->
<!-- begin mark -->
<p>This will be pulled later, but we will come to it when I get to pull the previous section.</p>
<!-- end mark -->
In a reply, they point to Crummy explanation on navigating the html tree, but I didnt find there and answer to my problem.
Any ideas? Thanks.
PS. Extra kudos if someone point me an elegant way to repeat the process a few times in a document, as I probably may get it to work, but poorly :D
Edited to add:
With the information provided by Martijn Pieters, I got to pass the comments array obtained using the above code to the generator function he designed. So this gives no error:
for elem in comments:
htmlcode = allnext(comments)
print htmlcode
I think now it will be possible to manipulate the htmlcode content before iterating through the array.
You can use the .next_sibling pointer to get to the next element. You can use that to find everything following a comment, up to but not including another comment:
from bs4 import Comment
def allnext(comment):
curr = comment
while True:
curr = curr.next_sibling
if isinstance(curr, Comment):
return
yield curr
This is a generator function, you use it to iterate over all 'next' elements:
for elem in allnext(comment):
print elem
or you can use it to create a list of all next elements:
elems = list(allnext(comment))
Your example is a little too small for BeautifulSoup and it'll wrap each comment in a <p> tag but if we use a snippet from your original target www.gamespot.com this works just fine:
<div class="ad_wrap ad_wrap_dart"><div style="text-align:center;"><img alt="Advertisement" src="http://ads.com.com/Ads/common/advertisement.gif" style="display:block;height:10px;width:120px;margin:0 auto;"/></div>
<!-- start of gamespot gpt ad tag -->
<div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>
<!-- end of gamespot gpt tag -->
</div>
If comment is a reference to the first comment in that snippet, the allnext() generator gives me:
>>> list(allnext(comment))
[u'\n', <div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>, u'\n']

html5lib. How to get valid html without adding html, head and body tags?

I'm validating custom HTML from users with html5lib. The problem is the html5lib adds html, head and body tags, which I don't need.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
f = open('/home/user/ex.html')
doc = parser.parse(f)
doc.toxml()
'<html><head/><body><div>\n speedhunters.com\n</div>\n</body></html>'
This is validated, can be sanitized, but how can I remove or prevent adding these tags to the tree?
I mean exclude replace using.
It seems that we can use the hidden property of Tags in order to prevent the tag itself from being 'exported' when casting a tag/soup to string/unicode:
>>> from bs4 import BeautifulSoup
>>> html = u"<div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print soup.body.prettify()
<body>
<div>
<footer>
<h3>
foot
</h3>
</footer>
</div>
<div>
foo
</div>
</body>
Essentially, the questioner's goal is to get the entire content of the body tag without the <body> wrapper itself. This works:
>>> soup.body.hidden=True
>>> print soup.body.prettify()
<div>
<footer>
<h3>
foot
</h3>
</footer>
</div>
<div>
foo
</div>
I found this by going through the BeautifulSoup source. After calling soup = BeautifulSoup(html), the root tag has the internal name '[document]'. By default, only the root tag has hidden==True. This prevents its name from ending up in any HTML output.
Wow, html5lib has horrible documentation.
Looking through the source, and working on a quick test case, this appears to work:
import html5lib
from html5lib import treebuilders
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
with open('test.html') as test:
doc = parser.parse(test)
for child in doc:
if child.parent.name == "body":
return child.toxml()
It's a bit hackish, but less so than a replace().
lxml may be a better choice if you're dealing with "uncommon" html.

Categories