How to modify XML as text in lxml - python

I have an XML file generated by an IDE; however, it unfortunately outputs code with newlines as BRs and seems to randomly decide where to place newlines. Example:
if test = true
foo;
bar;
endif
becomes the following XTML within an XML file:
<body>
<p>if test = true<br /> foo;<br /> bar;<br />endif
</p>
</body>
I am trying to make a pre-processor for these files in python using lxml to make it easier to version control them. However, I cannot figure out to modify the XML as text so that I can place each BR on it's own line like the following:
<body>
<p>if test = true
<br /> foo;
<br /> bar;
<br />endif
</p>
</body>
How does one edit the xml as text, or failing that, is there another way to get the results like above?

One option would be to add a new-line character to the p tag's text and br tag tails. Example:
from lxml import html
data = """
<html>
<body>
<p>if test = true<br /> foo;<br /> bar;<br />endif
</p>
</body>
</html>
"""
tree = html.fromstring(data)
p = tree.find('.//p')
p.text += '\n'
for element in tree.xpath('.//br'):
element.tail += '\n'
print html.tostring(tree)
Prints:
<html>
<body>
<p>if test = true
<br> foo;
<br> bar;
<br>endif
</p>
</body>
</html>

Related

How to extract text and the xpath to that element of the HTML page in Python

I am working on a Django project where I need to extract all the text-containing elements and the xPath to that element.
E.G:
<html>
<head>
<title>
The Demo page
</title>
</head>
<body>
<div>
<section>
<h1> Hello world
</h1>
</section>
<div>
<p>
Hope you all are doing well,
</p>
</div>
<div>
<p>
This is the example HTML
</p>
</div>
</div>
</body>
</html>
The output should be something like:
/head/title: The Demo Page
/body/div/section/h1: Hello world!
/body/div/div[1]/p: Hope you all are doing well,
/body/div/div[2]/p: This is the example HTML
Something like this should work:
from lxml import etree
html = """[your html above]"""
root = etree.fromstring(html)
targets = root.xpath('//text()[normalize-space()]/..')
tree = etree.ElementTree(root)
for target in targets:
print(tree.getpath(target),target.text.strip())
Output:
/html/head/title The Demo page
/html/body/div/section/h1 Hello world
/html/body/div/div[1]/p Hope you all are doing well,
/html/body/div/div[2]/p This is the example HTML

Add missing paragraph tags to HTML

I'm processing some medium fancy HTML pages to convert to simpler XHTML ones. The source pages have several divs (that I'm removing), that contain text not inside <p> tags. I need to add these <p> tags.
Here is a minimal example of the source page
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>
I want to convert it to
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<p>This is a sample page</p>
<p>Lots of things to learn!</p>
<p>And lots to test</p>
<p>Enough with the sample code</p>
</body>
</html>
I'm developing a python script using BeautifulSoup4 to do all the stuff. Now I'm stuck at this step. And it looks more like a regex job to locate the text to embed in <p> tags, and pass it to BeautifulSoup4. What do you think is the best approach to the problem?
I've scan several pages and I've seen these wild texts at the start of divs, but I can't exclude there will be several more around the pages in random places. (i.e. a script that checks at start of divs won't probably be reliable).
Notice the <br/> tags that has to be used to split the <p> paragraphs.
This script will remove all tags from <body> but <p> and then creates new paragraphs in place of <br/>:
from bs4 import BeautifulSoup
txt = '''<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>'''
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.body.find_all(lambda tag: tag.name != 'p'):
tag.unwrap()
for txt in soup.body.find_all(text=True):
if txt.find_parent('p') or txt.strip() == '':
continue
txt.wrap(soup.new_tag("p"))
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<p>
Hello world!
</p>
<p>
This is a sample page
</p>
<p>
Lots of things to learn!
</p>
<p>
And lots to test
</p>
<p>
Enough with the sample code
</p>
</body>
</html>

How can I capture HTML, unmolested by the capturing library?

Is there a Python library out there that will let me get an arbitrary HTML snippet without molesting the markup? As far as I can tell, lxml, BeautifulSoup, and pyquery all make it easy to something like soup.find(".arbitrary-class"), but the HTML it returns is formatted. I want the raw, original markup.
So for example, say I have this:
<html>
<head>
<title>test</title>
</head>
<body>
<div class="arbitrary-class">
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
</div>
</body>
</html>
I want to capture exactly:
"
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
"
...spaces and all, and without mangling the tags to be properly formatted (as <br /> for example).
The trouble, it seems is that all 3 libraries appear to construct the DOM internally and simply return a Python object representing what the file should be rather than what it is, so I don't know where/how to get the original code snippet I need.
This code:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, "html.parser")
print soup.select(".arbitrary-class")[0].contents
will return you the list:
[u'\n This is some', <br/>, u'\n markup with ', <br/>, u'\n', <p>some potentially problematic</p>, u'\n stuff in it ', <input name="w00t" type="text"/>, u'\n']
EDIT:
As Daniel noted in the comments, this results in normalized tags.
The only alternative I can find is to use a parser generator, such as pyparsing. The code below is a slight modification to some of their example code for the withAttribute function.
from pyparsing import *
html = """<html>
<head>
<title>test</title>
</head>
<body>
<div class="arbitrary-class">
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
</div>
</body>
</html>"""
div,div_end = makeHTMLTags("div")
# only match div tag having a class attribute with value "arbitrary-class"
div_grid = div().setParseAction(withClass("arbitrary-class"))
grid_expr = div_grid + SkipTo(div | div_end)("body")
for grid_header in grid_expr.searchString(html):
print repr(grid_header.body)
The output from this code is as follows:
'\n This is some<br />\n markup with <br>\n <p>some potentially problematic</p>\n stuff in it <input type="text" name="w00t">'
Note that the first <br/> now has a space, and the <input> tag no longer has an added / before the closing >. The only difference from your specification is that the trailing white space is missing. You might be able to resolve this difference by refining this solution.

Get a structure of HTML code

I'm using BeautifulSoup4 and I'm curious whether is there a function which returns a structure (ordered tags) of the HTML code.
Here is an example:
<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>
print page.structure():
>>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>
I tried to find a solution but no success.
Thanks
There is not, to my knowledge, but a little recursion should work:
def taggify(soup):
for tag in soup:
if isinstance(tag, bs4.Tag):
yield '<{}>{}</{}>'.format(tag.name,''.join(taggify(tag)),tag.name)
demo:
html = '''<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>'''
soup = BeautifulSoup(html)
''.join(taggify(soup))
Out[34]: '<html><body><h1></h1><p></p></body></html>'
Simple python regular expressions can do what you want:
import re
html = '''<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>'''
structure = ''.join(re.findall(r'(</?.+?>|/n+?)', html))
This methods preserves newline characters.

Editing tree in place while iterating in lxml

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?
To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

Categories