How can I capture HTML, unmolested by the capturing library? - python

Is there a Python library out there that will let me get an arbitrary HTML snippet without molesting the markup? As far as I can tell, lxml, BeautifulSoup, and pyquery all make it easy to something like soup.find(".arbitrary-class"), but the HTML it returns is formatted. I want the raw, original markup.
So for example, say I have this:
<html>
<head>
<title>test</title>
</head>
<body>
<div class="arbitrary-class">
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
</div>
</body>
</html>
I want to capture exactly:
"
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
"
...spaces and all, and without mangling the tags to be properly formatted (as <br /> for example).
The trouble, it seems is that all 3 libraries appear to construct the DOM internally and simply return a Python object representing what the file should be rather than what it is, so I don't know where/how to get the original code snippet I need.

This code:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, "html.parser")
print soup.select(".arbitrary-class")[0].contents
will return you the list:
[u'\n This is some', <br/>, u'\n markup with ', <br/>, u'\n', <p>some potentially problematic</p>, u'\n stuff in it ', <input name="w00t" type="text"/>, u'\n']
EDIT:
As Daniel noted in the comments, this results in normalized tags.
The only alternative I can find is to use a parser generator, such as pyparsing. The code below is a slight modification to some of their example code for the withAttribute function.
from pyparsing import *
html = """<html>
<head>
<title>test</title>
</head>
<body>
<div class="arbitrary-class">
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
</div>
</body>
</html>"""
div,div_end = makeHTMLTags("div")
# only match div tag having a class attribute with value "arbitrary-class"
div_grid = div().setParseAction(withClass("arbitrary-class"))
grid_expr = div_grid + SkipTo(div | div_end)("body")
for grid_header in grid_expr.searchString(html):
print repr(grid_header.body)
The output from this code is as follows:
'\n This is some<br />\n markup with <br>\n <p>some potentially problematic</p>\n stuff in it <input type="text" name="w00t">'
Note that the first <br/> now has a space, and the <input> tag no longer has an added / before the closing >. The only difference from your specification is that the trailing white space is missing. You might be able to resolve this difference by refining this solution.

Related

Add missing paragraph tags to HTML

I'm processing some medium fancy HTML pages to convert to simpler XHTML ones. The source pages have several divs (that I'm removing), that contain text not inside <p> tags. I need to add these <p> tags.
Here is a minimal example of the source page
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>
I want to convert it to
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<p>This is a sample page</p>
<p>Lots of things to learn!</p>
<p>And lots to test</p>
<p>Enough with the sample code</p>
</body>
</html>
I'm developing a python script using BeautifulSoup4 to do all the stuff. Now I'm stuck at this step. And it looks more like a regex job to locate the text to embed in <p> tags, and pass it to BeautifulSoup4. What do you think is the best approach to the problem?
I've scan several pages and I've seen these wild texts at the start of divs, but I can't exclude there will be several more around the pages in random places. (i.e. a script that checks at start of divs won't probably be reliable).
Notice the <br/> tags that has to be used to split the <p> paragraphs.
This script will remove all tags from <body> but <p> and then creates new paragraphs in place of <br/>:
from bs4 import BeautifulSoup
txt = '''<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>'''
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.body.find_all(lambda tag: tag.name != 'p'):
tag.unwrap()
for txt in soup.body.find_all(text=True):
if txt.find_parent('p') or txt.strip() == '':
continue
txt.wrap(soup.new_tag("p"))
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<p>
Hello world!
</p>
<p>
This is a sample page
</p>
<p>
Lots of things to learn!
</p>
<p>
And lots to test
</p>
<p>
Enough with the sample code
</p>
</body>
</html>

BeautifulSoup (bs4) parsing wrong

Parsing this sample document with bs4, from python 2.7.6:
<html>
<body>
<p>HTML allows omitting P end-tags.
<p>Like that and this.
<p>And this, too.
<p>What happened?</p>
<p>And can we <p>nest a paragraph, too?</p></p>
</body>
</html>
Using:
from bs4 import BeautifulSoup as BS
...
tree = BS(fh)
HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those paragraphs until it sees </body>:
<html>
<body>
<p>
HTML allows omitting P end-tags.
<p>
Like that and this.
<p>
And this, too.
<p>
What happened?
</p>
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
</p>
</p>
</p>
</body>
It's not prettify()'s fault, because traversing the tree manually I get the same structure:
<[document]>
<html>
␊
<body>
␊
<p>
HTML allows omitting P end-tags.␊␊
<p>
Like that and this.␊␊
<p>
And this, too.␊␊
<p>
What happened?
</p>
␊
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
␊
</p>
</p>
</p>
</body>
␊
</html>
␊
</[document]>
Now, this would be the right result for XML (at least up to </body>, at which point it should report a WF error). But this ain't XML. What gives?
The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6.
Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces the correct result:
tree = BS(htmSource, "html5lib")

Parse HTML string from a file and remove element using xpath and write it to same file in python

For my project, I have to remove selective content from a html file using python xpath.
The selective element can be removed using .remove() method, however the content in file looks same.
How do I write the modified content to that file again?
Though If i try to write the same tree object to file using open().write(etree.tostring(tree_obj)), will the content differs for unicode pages? Is there anyother way to keep the modified file?
Why the header tags in below output has different value after printing the tree object?
Please suggest.
Below is my code sample.
Example: I need to remove all div tags inside the html page.
HTML file:
<html>
<head>test</head>
<body>
<p>welcome to the world</p>
<div id="one">
<p>one</p>
<div id="one1">one1</div>
<ul>
<li>ones</li>
<li>twos</li>
<li>threes</li>
</ul>
</div>
<div id="hell">
<p>heaven</p>
<div id="one1">one1</div>
<ul>
<li>ones</li>
<li>twos</li>
<li>threes</li>
</ul>
</div>
<input type="text" placeholder="enter something.." />
<input type="button" value="click" />
</body>
</html>
Python file:
# _*_ coding:utf-8 _*_
import os
import sys
import traceback
import datetime
from lxml import etree, html
import shutil
def process():
fd=open("D:\\hello.html")
tree = html.fromstring(fd.read())
remove_tag = '//div'
for element in tree.xpath(remove_tag):
element.getparent().remove(element)
print etree.tostring(tree)
process()
OUTPUT:
<html>
<head/><body><p>test
</p>
<p>welcome to the world</p>
<input type="text" placeholder="enter something.."/>
<input type="button" value="click"/>
</body></html>
I haven't worked on python but i have played with parsing html based websites using Java with help of library jsoup.
Python also has similar one like this. Beautiful soup. You can play with this thing to get desired output.
Hope it helps.
Have you tried using python's standard library re?
import re</br>
re.sub('<.*?>','', '<nb>foobar<aon><mn>')
re.sub('</.*?>','', '</nb>foobar<aon><mn>')
The above two operations could be used in combination to remove all the html tags. It can be easily modified to remove the div tags too.

How to modify XML as text in lxml

I have an XML file generated by an IDE; however, it unfortunately outputs code with newlines as BRs and seems to randomly decide where to place newlines. Example:
if test = true
foo;
bar;
endif
becomes the following XTML within an XML file:
<body>
<p>if test = true<br /> foo;<br /> bar;<br />endif
</p>
</body>
I am trying to make a pre-processor for these files in python using lxml to make it easier to version control them. However, I cannot figure out to modify the XML as text so that I can place each BR on it's own line like the following:
<body>
<p>if test = true
<br /> foo;
<br /> bar;
<br />endif
</p>
</body>
How does one edit the xml as text, or failing that, is there another way to get the results like above?
One option would be to add a new-line character to the p tag's text and br tag tails. Example:
from lxml import html
data = """
<html>
<body>
<p>if test = true<br /> foo;<br /> bar;<br />endif
</p>
</body>
</html>
"""
tree = html.fromstring(data)
p = tree.find('.//p')
p.text += '\n'
for element in tree.xpath('.//br'):
element.tail += '\n'
print html.tostring(tree)
Prints:
<html>
<body>
<p>if test = true
<br> foo;
<br> bar;
<br>endif
</p>
</body>
</html>

Editing tree in place while iterating in lxml

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?
To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

Categories