BeautifulSoup (bs4) parsing wrong - python

Parsing this sample document with bs4, from python 2.7.6:
<html>
<body>
<p>HTML allows omitting P end-tags.
<p>Like that and this.
<p>And this, too.
<p>What happened?</p>
<p>And can we <p>nest a paragraph, too?</p></p>
</body>
</html>
Using:
from bs4 import BeautifulSoup as BS
...
tree = BS(fh)
HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those paragraphs until it sees </body>:
<html>
<body>
<p>
HTML allows omitting P end-tags.
<p>
Like that and this.
<p>
And this, too.
<p>
What happened?
</p>
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
</p>
</p>
</p>
</body>
It's not prettify()'s fault, because traversing the tree manually I get the same structure:
<[document]>
<html>
␊
<body>
␊
<p>
HTML allows omitting P end-tags.␊␊
<p>
Like that and this.␊␊
<p>
And this, too.␊␊
<p>
What happened?
</p>
␊
<p>
And can we
<p>
nest a paragraph, too?
</p>
</p>
␊
</p>
</p>
</p>
</body>
␊
</html>
␊
</[document]>
Now, this would be the right result for XML (at least up to </body>, at which point it should report a WF error). But this ain't XML. What gives?

The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6.
Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces the correct result:
tree = BS(htmSource, "html5lib")

Related

Add missing paragraph tags to HTML

I'm processing some medium fancy HTML pages to convert to simpler XHTML ones. The source pages have several divs (that I'm removing), that contain text not inside <p> tags. I need to add these <p> tags.
Here is a minimal example of the source page
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>
I want to convert it to
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<p>This is a sample page</p>
<p>Lots of things to learn!</p>
<p>And lots to test</p>
<p>Enough with the sample code</p>
</body>
</html>
I'm developing a python script using BeautifulSoup4 to do all the stuff. Now I'm stuck at this step. And it looks more like a regex job to locate the text to embed in <p> tags, and pass it to BeautifulSoup4. What do you think is the best approach to the problem?
I've scan several pages and I've seen these wild texts at the start of divs, but I can't exclude there will be several more around the pages in random places. (i.e. a script that checks at start of divs won't probably be reliable).
Notice the <br/> tags that has to be used to split the <p> paragraphs.
This script will remove all tags from <body> but <p> and then creates new paragraphs in place of <br/>:
from bs4 import BeautifulSoup
txt = '''<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>'''
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.body.find_all(lambda tag: tag.name != 'p'):
tag.unwrap()
for txt in soup.body.find_all(text=True):
if txt.find_parent('p') or txt.strip() == '':
continue
txt.wrap(soup.new_tag("p"))
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<p>
Hello world!
</p>
<p>
This is a sample page
</p>
<p>
Lots of things to learn!
</p>
<p>
And lots to test
</p>
<p>
Enough with the sample code
</p>
</body>
</html>

How can I capture HTML, unmolested by the capturing library?

Is there a Python library out there that will let me get an arbitrary HTML snippet without molesting the markup? As far as I can tell, lxml, BeautifulSoup, and pyquery all make it easy to something like soup.find(".arbitrary-class"), but the HTML it returns is formatted. I want the raw, original markup.
So for example, say I have this:
<html>
<head>
<title>test</title>
</head>
<body>
<div class="arbitrary-class">
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
</div>
</body>
</html>
I want to capture exactly:
"
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
"
...spaces and all, and without mangling the tags to be properly formatted (as <br /> for example).
The trouble, it seems is that all 3 libraries appear to construct the DOM internally and simply return a Python object representing what the file should be rather than what it is, so I don't know where/how to get the original code snippet I need.
This code:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, "html.parser")
print soup.select(".arbitrary-class")[0].contents
will return you the list:
[u'\n This is some', <br/>, u'\n markup with ', <br/>, u'\n', <p>some potentially problematic</p>, u'\n stuff in it ', <input name="w00t" type="text"/>, u'\n']
EDIT:
As Daniel noted in the comments, this results in normalized tags.
The only alternative I can find is to use a parser generator, such as pyparsing. The code below is a slight modification to some of their example code for the withAttribute function.
from pyparsing import *
html = """<html>
<head>
<title>test</title>
</head>
<body>
<div class="arbitrary-class">
This is some<br />
markup with <br>
<p>some potentially problematic</p>
stuff in it <input type="text" name="w00t">
</div>
</body>
</html>"""
div,div_end = makeHTMLTags("div")
# only match div tag having a class attribute with value "arbitrary-class"
div_grid = div().setParseAction(withClass("arbitrary-class"))
grid_expr = div_grid + SkipTo(div | div_end)("body")
for grid_header in grid_expr.searchString(html):
print repr(grid_header.body)
The output from this code is as follows:
'\n This is some<br />\n markup with <br>\n <p>some potentially problematic</p>\n stuff in it <input type="text" name="w00t">'
Note that the first <br/> now has a space, and the <input> tag no longer has an added / before the closing >. The only difference from your specification is that the trailing white space is missing. You might be able to resolve this difference by refining this solution.

BeautifulSoup lxml parser closing tags where it shouldn't be

I'm using BeautifulSoup's lxml parser to parse some html. However, it's not being parsed as it's written. For instance, the following code:
import bs4
my_html = '''
<html>
<body>
<B>
<P>
Hello, I am some bolded text
</P>
</B>
</body>
</html>
'''
soup = bs4.BeautifulSoup(my_html, 'lxml')
print soup.prettify()
will print:
<html>
<body>
<b>
</b>
<p>
Hello, I am some bolded text
</p>
</body>
</html>
You can see that somehow the <B> tag from my_html gets closed off before the <p> tag in the prettified version, even though it should be closed off after the </p>. Any ideas about what might be going on? I'm totally baffled.
That's because paragraphs are not allowed inside the <b> tag.
Only tags that accept flow content are allowed as the parent of <p> tags. See here for a list.
However, you can do the reverse; <p> is allowed as the parent for <b> tags. In your case, your can change your raw HTML to something like this:
my_html = '''
<html>
<body>
<p>
<b>
Hello, I am some bolded text
</b>
</p>
</body>
</html>
'''
This is because you can't have a <p> tag inside of a <b> tag, so the parser is trying to fix broken HTML. Using html5lib's html5lib parser or Python's html.parser will result in your expected output (I only know this because I just tested it).

double encoded html code

I use xinha as WYSIWYG editor for html-content.
I sent html-articles via post-form to postgresql.
So far so good, they seem ok.
But when I receive and output from pg to an html page, I see double encoded, i.e. broken html code
like this
<p><a href="http://google.com">google.com</a></p> <p> </p> <p>
Any idea on where to search for the issue?
Thanks in advance
import HTMLParser
hp=HTMLParser.HTMLParser()
s="<p><a href="http://google.com">google.com</a></p> <p> </p> <p>"
print hp.unescape(s)
# u'<p>google.com</p> <p> </p> <p>'

Editing tree in place while iterating in lxml

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?
To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

Categories