Equivalent to InnerHTML when using lxml.html to parse HTML - python

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.
<body>
<h1>A title</h1>
<p>Some text</p>
</body>
InnerHtml is therefore:
<h1>A title</h1>
<p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.

Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:
<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>
Text directly under the root element is ignored. I ended up doing this:
(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
... print etree.tostring(child)
...
<h1>A title</h1>
<p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])

import lxml.etree as ET
body = t.xpath("//body");
for tag in body:
h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
p = html.fromstring( ET.tostring(tag[1]) ).xpath("//p");
htext = h[0].text_content();
ptext = h[0].text_content();
you can also use .get('href') for a tag and .attrib for attribute ,
here tag no is hardcoded but you can also do this dynamic

Here is a Python 3 version:
from xml.sax import saxutils
from lxml import html
def inner_html(tree):
""" Return inner HTML of lxml element """
return (saxutils.escape(tree.text) if tree.text else '') + \
''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])
Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!

I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:
from lxml import etree, html
# generate some HTML element node
node = html.fromstring("""<container>
Some random text <b>bold <i>italic</i> yeah</b> no yeah
<!-- comment blah blah --> <img src='gaga.png' />
</container>""")
# compute inner HTML of element
innerHTML = "".join([
str(c) if type(c)==etree._ElementUnicodeResult
else html.tostring(c, with_tail=False).decode()
for c in node.xpath("node()")
]).strip()
The result will be:
'Some random text <b>bold <i>italic</i> yeah</b> no yeah\n<!-- comment blah blah --> <img src="gaga.png">'
What it does: The xpath delivers all node children (text, elements, comments). The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes. Those are then joined into a single string. If you want to get rid of comments, use *|text() instead of node() for xpath.

Related

How do I use lxml and python to traverse the <body> of a html document along with its children

I would like to take an html document and traverse the <body> part of the document with its children. I see lots of examples to get a subtree via xpath or tag name but this doesn't seem to give the children.
import lxml
from lxml import html, etree
html3 = "<html><head><title>test<body><h1>page title</h3><p>some text</p>"
root = lxml.html.fromstring(html3)
tree = etree.ElementTree(root)
for el in root.iter():
# do something
print(el.text, tree.getpath(el))
This will output
None /html
None /html/head
test /html/head/title
None /html/body
page title /html/body/h1
some text /html/body/p
I would like only
page title /html/body/h1
some text /html/body/p
Any help gratefully received.
I had similar difficulty, then I figured that each etree node has an iterator if its parent using which you can traverse
for instance, root here will give you the body using that you can iterate each element of body
from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('yourdocument.html', parser)
root = tree.xpath('/html/body/')[0]
for i in root.getiterator():
print(i.tag,i.text)
It seems that your html code has an invalid format, I just wrote a little program with beautifuSoup that maybe you can use to modify for your purpose:
from bs4 import BeautifulSoup
html3 = "<html><head><title>test</title></head><body><h1>page title</h1><p>some text</p><body></html>"
soup = BeautifulSoup(html3, "html5lib")
body = soup.find('body')
for item in body.findChildren():
print(item)
Output
<h1>page title</h1>
<p>some text</p>

lxml etree HTML parser changes order of nodes (<center> inside <p>)

I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.
from lxml import etree
from io import StringIO, BytesIO
string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html")
I get this output:
<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>
As you can see (let's ignore the <html> and <body> tags etree adds), the order of the nodes has been changed! The <p> tag that used to wrap the <center> tag, now loses its content, and that content gets added after the </p> tag closes. Eh?
When I omit the <center> tag, all of a sudden the parsing is done right:
from lxml import etree
from io import StringIO, BytesIO
string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))
With correct output:
<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>
Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the <p> and <center> tags, as I read them this way.
<center> is a block level element.
<p> cannot legally contain block level elements.
Therefore the parser closes the <p> when it encounters <center>.
Use valid HTML - or an XML parser, which does not care about HTML rules (but in exchange can't deal with some of the HTML specifics, like most named entities, such as or unclosed/self-closing tags).
Centering content has been done with CSS for ages now, there is no reason to use <center> anymore (and, in fact, it's deprecated). But it still works, and if you insist on using it, switch the nesting.
<center><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></center>

BeautifulSoup replace_with for non-standard tags

I'm trying to write a parser that will take HTML and convert/output to Wiki syntax (<b> = ''', <i> = '', etc).
So far, BeautifulSoup seems only capable of replacing the contents within a tag, so <b> becomes <'''> instead of '''. I can use a re.sub() to swap these out, but since BS turns the document into a 'complex tree of Python objects', I can't figure out how to swap out these tags and re-insert them into the overall document.
Does anyone have ideas?
I am pretty sure there are already tools that would do that for you, but if you are asking about how to do that with BeautifulSoup, you can use replace_with(), but you would need to preserve the text of the element. Naive and simple example:
from bs4 import BeautifulSoup
data = """
<div>
<b>test1</b>
<i>test2</i>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
b.replace_with("'''%s'''" % b.text)
for i in soup.find_all("i"):
i.replace_with("''%s''" % i.text)
print(soup.prettify())
Prints:
<div>
'''test1'''
''test2''
</div>
To also handle nested tags, e.g. "<div><b>bold with some <i>italics</i></b></div>" you have to be a bit more careful.
I put together the following implementation when I needed to do something similar:
from bs4 import BeautifulSoup
def wikify_tag(tag, replacement):
tag.insert(0, replacement)
tag.append(replacement)
tag.unwrap()
data = """
<div>
<b>test1</b>
<i>test2</i>
<b>bold with some <i>italics</i></b>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for b in soup.find_all("b"):
wikify_tag(b, "'''")
for i in soup.find_all("i"):
wikify_tag(i, "''")
print(soup)
Prints (note that .prettify() makes it look uglier):
<div>
'''test1'''
''test2''
'''bold with some ''italics'''''
</div>
If you also want to replace tags with wiki-templates you can extend wikify_tag to take a start and an end string.

Find text using lxml etree

I'm trying to get a text from one tag using lxml etree.
<div class="litem__type">
<div>
Robbp
</div>
<div>Estimation</div>
+487 (0)639 14485653
•
<a href="mailto:herbrich#gmail.com">
Email Address
</a>
•
<a class="external" href="http://www.google.com">
Homepage
</a>
</div>
The problem is that I can't locate it because there are many differences between this kind of snippets. There are situations, when the first and second div is not there at all. As you can see, the telephone number is not in it's own div.
I suppose that it would be possible to extract the telephone using BeautifulSoups contents but I'm trying to use lxml module's xpath.
Do you have any ideas? (email don't have to be there sometimes)
EDIT: The best idea is probably to use regex but I don't know how to tell it that it should extract just text between two <div></div>
You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees.
The text after element A's closing tag, but before element B's opening tag, is called element A's tail text. To select this tail text using lxml etree you could do the following:
content = '''
<div class="litem__type">
<div>Robbp</div>
<div>Estimation</div>
+487 (0)639 14485653
Email Address
<a class="external" href="http://www.google.com">Homepage</a>
</div>'''
from lxml import etree
tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)
Output
'+487 (0)639 14485653'
The strip() function is used here to remove whitespace on either side of the tail text.
You can iterate and get text after div tag.
from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
# you can check here if it is a phone number
print node.tail

lxml equivalent to BeautifulSoup "OR" syntax?

I'm converting some html parsing code from BeautifulSoup to lxml. I'm trying to figure out the lxml equivalent syntax for the following BeautifullSoup statement:
soup.find('a', {'class': ['current zzt', 'zzt']})
Basically I want to find all of the "a" tags in the document that have a class attribute of either "current zzt" or "zzt". BeautifulSoup allows one to pass in a list, dictionary, or even a regular express to perform the match.
What is the lxml equivalent?
Thanks!
No, lxml does not provide the "find first or return None" method you're looking for. Just use (select(soup) or [None])[0] if you need that, or write a function to do it for you.
#!/usr/bin/python
import lxml.html
import lxml.cssselect
soup = lxml.html.fromstring("""
<html>
<a href="foo" class="yyy zzz" />
<a href="bar" class="yyy" />
<a href="baz" class="zzz" />
<a href="quux" class="zzz yyy" />
<a href="warble" class="qqq" />
<p class="yyy zzz">Hello</p>
</html>""")
select = lxml.cssselect.CSSSelector("a.yyy.zzz, a.yyy")
print [lxml.html.tostring(s).strip() for s in select(soup)]
print (select(soup) or [None])[0]
Ok, so soup.find('a') would indeed find first a element or None as you expect. Trouble is, it doesn't appear to support the rich XPath syntax needed for CSSSelector.

Categories